Linux semantics for fallocate() #2499

dweeezil · 2014-07-15T20:25:55Z

According to the Linux documentation, the FALLOC_FL_KEEP_SIZE flag must be
set in conjunction with the FALLOC_FL_PUNCH_HOLE flag. The previous
logic caused a failure whenever FALLOC_FL_KEEP_SIZE was present.

Also, mimic the behavior of other native file systems such as ext4 in
cases where the file might be extended. If the offset is beyond the end
of the file, return success without changing the file. If the extent of
the punched hole would extend the file, only the existing tail of the
file is punched.

References:
https://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/man2/fallocate.2#n102

dweeezil · 2014-07-15T20:26:14Z

This causes hole punching to work as it does in ext4.

kernelOfTruth · 2014-07-16T02:18:50Z

related to #326

kernelOfTruth · 2014-07-16T11:16:52Z

ok, a question:

does this mean that in practice rsync with --preallocate does use this now ?

dweeezil · 2014-07-16T11:49:51Z

No. I should have been a bit more clear in the commit message. Being copy-on-write, there is no concept of pre-allocation in ZFS. This patch makes the FALLOC_FL_PUNCH_HOLE flag work as it's supposed to (as it does in ext4). It fixes the flag check to allow it to work and adds some post-processing of the arguments to mimic the semantics of the operation in ext4.

dweeezil · 2014-07-16T12:16:39Z

OK, I see that the original patch wasn't intended to support FALLOC_FL_KEEP_SIZE which what this patch attempts to do. At least it seems to work under light testing.

dweeezil · 2014-07-16T13:37:32Z

I wrote a somewhat more comprehensive testing script: https://gist.github.com/dweeezil/e1b5d3953f4d4fefb888 which seems to the the Right Thing.

behlendorf · 2014-07-17T22:18:37Z

xfstests has a few tests for this as well. I'll run it through and see how it does.

behlendorf · 2014-07-17T23:52:29Z

This patch causes xfstests/127 to fail.

127  - output mismatch (see 127.out.bad)
--- 127.out 2014-07-17 15:45:42.695156001 -0700
+++ 127.out.bad 2014-07-17 16:05:51.259156001 -0700
@@ -6,8 +6,180 @@
 === FSX Standard Mode, No Memory Mapping ===
 All operations completed A-OK!
 === FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0x26627, size = 0x1167, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD    BAD RANGE
+0x26627    0x0000  0xf514  0x    0
+operation# (mod 256) for the bad data may be 20
+0x26628    0x0000  0x1477  0x    1
+operation# (mod 256) for the bad data may be 20
+0x26629    0x0000  0x7714  0x    2
+operation# (mod 256) for the bad data may be 20
+0x2662a    0x0000  0x143b  0x    3
+operation# (mod 256) for the bad data may be 20
+0x2662b    0x0000  0x3b14  0x    4
+operation# (mod 256) for the bad data may be 20
+0x2662c    0x0000  0x144b  0x    5
+operation# (mod 256) for the bad data may be 20
+0x2662d    0x0000  0x4b14  0x    6
+operation# (mod 256) for the bad data may be 20
+0x2662e    0x0000  0x14f6  0x    7
+operation# (mod 256) for the bad data may be 20
+0x2662f    0x0000  0xf614  0x    8
+operation# (mod 256) for the bad data may be 20
+0x26630    0x0000  0x140c  0x    9
+operation# (mod 256) for the bad data may be 20
+0x26631    0x0000  0x0c14  0x    a
+operation# (mod 256) for the bad data may be 20
+0x26632    0x0000  0x1450  0x    b
+operation# (mod 256) for the bad data may be 20
+0x26633    0x0000  0x5014  0x    c
+operation# (mod 256) for the bad data may be 20
+0x26634    0x0000  0x1495  0x    d
+operation# (mod 256) for the bad data may be 20
+0x26635    0x0000  0x9514  0x    e
+operation# (mod 256) for the bad data may be 20
+0x26636    0x0000  0x1481  0x    f
+operation# (mod 256) for the bad data may be 20
+LOG DUMP (49 total operations):
+1(  1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff  (0x24c7 bytes)
+2(  2 mod 256): MAPREAD  0x2e947 thru 0x33163  (0x481d bytes)
+3(  3 mod 256): READ     0x2e836 thru 0x3cba1  (0xe36c bytes)
+4(  4 mod 256): PUNCH    0x2e7 thru 0x5c42 (0x595c bytes)
+5(  5 mod 256): MAPWRITE 0xcaea thru 0x13ba9   (0x70c0 bytes)
+6(  6 mod 256): PUNCH    0x31645 thru 0x38d1c  (0x76d8 bytes)
+7(  7 mod 256): SKIPPED (no operation)
+8(  8 mod 256): MAPREAD  0x2ed6b thru 0x3413b  (0x53d1 bytes)
+9(  9 mod 256): READ     0x907c thru 0xf535    (0x64ba bytes)
+10( 10 mod 256): WRITE    0x37786 thru 0x3b8e0 (0x415b bytes)
+11( 11 mod 256): MAPREAD  0x3a290 thru 0x3ffff (0x5d70 bytes)
+12( 12 mod 256): READ     0x22f86 thru 0x2d9aa (0xaa25 bytes)  ***RRRR***
+13( 13 mod 256): SKIPPED (no operation)
+14( 14 mod 256): WRITE    0x32ae3 thru 0x3a478 (0x7996 bytes)
+15( 15 mod 256): MAPWRITE 0x5acb thru 0x843a   (0x2970 bytes)
+16( 16 mod 256): SKIPPED (no operation)
+17( 17 mod 256): SKIPPED (no operation)
+18( 18 mod 256): READ     0x279cd thru 0x28094 (0x6c8 bytes)
+19( 19 mod 256): TRUNCATE DOWN from 0x40000 to 0x61da  ******WWWW
+20( 20 mod 256): MAPWRITE 0x2517c thru 0x2d393 (0x8218 bytes)  ******WWWW
+21( 21 mod 256): READ     0x293fe thru 0x2d393 (0x3f96 bytes)
+22( 22 mod 256): MAPWRITE 0x27b51 thru 0x2e5a2 (0x6a52 bytes)
+23( 23 mod 256): TRUNCATE UP   from 0x2e5a3 to 0x2e744
+24( 24 mod 256): SKIPPED (no operation)
+25( 25 mod 256): SKIPPED (no operation)
+26( 26 mod 256): SKIPPED (no operation)
+27( 27 mod 256): MAPREAD  0xcdd4 thru 0x13e17  (0x7044 bytes)
+28( 28 mod 256): MAPWRITE 0x1a705 thru 0x1f5e0 (0x4edc bytes)
+29( 29 mod 256): MAPREAD  0x1c452 thru 0x1d63a (0x11e9 bytes)
+30( 30 mod 256): MAPWRITE 0x373fc thru 0x3760d (0x212 bytes)
+31( 31 mod 256): SKIPPED (no operation)
+32( 32 mod 256): PUNCH    0x14c02 thru 0x18bd0 (0x3fcf bytes)
+33( 33 mod 256): SKIPPED (no operation)
+34( 34 mod 256): PUNCH    0x26117 thru 0x30089 (0x9f73 bytes)  ******PPPP
+35( 35 mod 256): SKIPPED (no operation)
+36( 36 mod 256): SKIPPED (no operation)
+37( 37 mod 256): WRITE    0x3845 thru 0xc58f   (0x8d4b bytes)
+38( 38 mod 256): WRITE    0x2d2e9 thru 0x38bf7 (0xb90f bytes) EXTEND
+39( 39 mod 256): SKIPPED (no operation)
+40( 40 mod 256): WRITE    0x7a3 thru 0x3d5f    (0x35bd bytes)
+41( 41 mod 256): WRITE    0xd6eb thru 0x14efe  (0x7814 bytes)
+42( 42 mod 256): WRITE    0x201f thru 0x3bd0   (0x1bb2 bytes)
+43( 43 mod 256): SKIPPED (no operation)
+44( 44 mod 256): SKIPPED (no operation)
+45( 45 mod 256): SKIPPED (no operation)
+46( 46 mod 256): MAPREAD  0x15931 thru 0x1ce44 (0x7514 bytes)
+47( 47 mod 256): MAPREAD  0x410f thru 0xee3d   (0xad2f bytes)
+48( 48 mod 256): READ     0x2002 thru 0x8e32   (0x6e31 bytes)
+49( 49 mod 256): READ     0x26627 thru 0x2778d (0x1167 bytes)  ***RRRR***
+Correct content saved for comparison
+(maybe hexdump "/tank/xfstests/fsx_std_mmap" vs "/tank/xfstests/fsx_std_mmap.fsxgood")
 === FSX Standard Mode, No Memory Mapping ===
 All operations completed A-OK!
 === FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -f -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0x26627, size = 0x1167, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD    BAD RANGE
+0x26627    0x0000  0xf514  0x    0
+operation# (mod 256) for the bad data may be 20
+0x26628    0x0000  0x1477  0x    1
+operation# (mod 256) for the bad data may be 20
+0x26629    0x0000  0x7714  0x    2
+operation# (mod 256) for the bad data may be 20
+0x2662a    0x0000  0x143b  0x    3
+operation# (mod 256) for the bad data may be 20
+0x2662b    0x0000  0x3b14  0x    4
+operation# (mod 256) for the bad data may be 20
+0x2662c    0x0000  0x144b  0x    5
+operation# (mod 256) for the bad data may be 20
+0x2662d    0x0000  0x4b14  0x    6
+operation# (mod 256) for the bad data may be 20
+0x2662e    0x0000  0x14f6  0x    7
+operation# (mod 256) for the bad data may be 20
+0x2662f    0x0000  0xf614  0x    8
+operation# (mod 256) for the bad data may be 20
+0x26630    0x0000  0x140c  0x    9
+operation# (mod 256) for the bad data may be 20
+0x26631    0x0000  0x0c14  0x    a
+operation# (mod 256) for the bad data may be 20
+0x26632    0x0000  0x1450  0x    b
+operation# (mod 256) for the bad data may be 20
+0x26633    0x0000  0x5014  0x    c
+operation# (mod 256) for the bad data may be 20
+0x26634    0x0000  0x1495  0x    d
+operation# (mod 256) for the bad data may be 20
+0x26635    0x0000  0x9514  0x    e
+operation# (mod 256) for the bad data may be 20
+0x26636    0x0000  0x1481  0x    f
+operation# (mod 256) for the bad data may be 20
+LOG DUMP (49 total operations):
+1(  1 mod 256): MAPWRITE 0x3db39 thru 0x3ffff  (0x24c7 bytes)
+2(  2 mod 256): MAPREAD  0x2e947 thru 0x33163  (0x481d bytes)
+3(  3 mod 256): READ     0x2e836 thru 0x3cba1  (0xe36c bytes)
+4(  4 mod 256): PUNCH    0x2e7 thru 0x5c42 (0x595c bytes)
+5(  5 mod 256): MAPWRITE 0xcaea thru 0x13ba9   (0x70c0 bytes)
+6(  6 mod 256): PUNCH    0x31645 thru 0x38d1c  (0x76d8 bytes)
+7(  7 mod 256): SKIPPED (no operation)
+8(  8 mod 256): MAPREAD  0x2ed6b thru 0x3413b  (0x53d1 bytes)
+9(  9 mod 256): READ     0x907c thru 0xf535    (0x64ba bytes)
+10( 10 mod 256): WRITE    0x37786 thru 0x3b8e0 (0x415b bytes)
+11( 11 mod 256): MAPREAD  0x3a290 thru 0x3ffff (0x5d70 bytes)
+12( 12 mod 256): READ     0x22f86 thru 0x2d9aa (0xaa25 bytes)  ***RRRR***
+13( 13 mod 256): SKIPPED (no operation)
+14( 14 mod 256): WRITE    0x32ae3 thru 0x3a478 (0x7996 bytes)
+15( 15 mod 256): MAPWRITE 0x5acb thru 0x843a   (0x2970 bytes)
+16( 16 mod 256): SKIPPED (no operation)
+17( 17 mod 256): SKIPPED (no operation)
+18( 18 mod 256): READ     0x279cd thru 0x28094 (0x6c8 bytes)
+19( 19 mod 256): TRUNCATE DOWN from 0x40000 to 0x61da  ******WWWW
+20( 20 mod 256): MAPWRITE 0x2517c thru 0x2d393 (0x8218 bytes)  ******WWWW
+21( 21 mod 256): READ     0x293fe thru 0x2d393 (0x3f96 bytes)
+22( 22 mod 256): MAPWRITE 0x27b51 thru 0x2e5a2 (0x6a52 bytes)
+23( 23 mod 256): TRUNCATE UP   from 0x2e5a3 to 0x2e744
+24( 24 mod 256): SKIPPED (no operation)
+25( 25 mod 256): SKIPPED (no operation)
+26( 26 mod 256): SKIPPED (no operation)
+27( 27 mod 256): MAPREAD  0xcdd4 thru 0x13e17  (0x7044 bytes)
+28( 28 mod 256): MAPWRITE 0x1a705 thru 0x1f5e0 (0x4edc bytes)
+29( 29 mod 256): MAPREAD  0x1c452 thru 0x1d63a (0x11e9 bytes)
+30( 30 mod 256): MAPWRITE 0x373fc thru 0x3760d (0x212 bytes)
+31( 31 mod 256): SKIPPED (no operation)
+32( 32 mod 256): PUNCH    0x14c02 thru 0x18bd0 (0x3fcf bytes)
+33( 33 mod 256): SKIPPED (no operation)
+34( 34 mod 256): PUNCH    0x26117 thru 0x30089 (0x9f73 bytes)  ******PPPP
+35( 35 mod 256): SKIPPED (no operation)
+36( 36 mod 256): SKIPPED (no operation)
+37( 37 mod 256): WRITE    0x3845 thru 0xc58f   (0x8d4b bytes)
+38( 38 mod 256): WRITE    0x2d2e9 thru 0x38bf7 (0xb90f bytes) EXTEND
+39( 39 mod 256): SKIPPED (no operation)
+40( 40 mod 256): WRITE    0x7a3 thru 0x3d5f    (0x35bd bytes)
+41( 41 mod 256): WRITE    0xd6eb thru 0x14efe  (0x7814 bytes)
+42( 42 mod 256): WRITE    0x201f thru 0x3bd0   (0x1bb2 bytes)
+43( 43 mod 256): SKIPPED (no operation)
+44( 44 mod 256): SKIPPED (no operation)
+45( 45 mod 256): SKIPPED (no operation)
+46( 46 mod 256): MAPREAD  0x15931 thru 0x1ce44 (0x7514 bytes)
+47( 47 mod 256): MAPREAD  0x410f thru 0xee3d   (0xad2f bytes)
+48( 48 mod 256): READ     0x2002 thru 0x8e32   (0x6e31 bytes)
+49( 49 mod 256): READ     0x26627 thru 0x2778d (0x1167 bytes)  ***RRRR***

dweeezil · 2014-07-18T12:19:00Z

Duplicated the failed xfstests/127 here. For some reason, I only looked into the fallocate-related xfstests and realized I couldn't use them (for a reason I already forgot). I'm looking into this one.

dweeezil · 2014-07-18T13:31:28Z

Adding a call to truncate_inode_pages_range() seems to fix it but I'd like to check whether that function is available in all kernels. I'm testing in 3.13.7 at the moment. See dweeezil/zfs@c348433.

rohan-puri · 2014-07-19T10:34:33Z

@dweeezil : I have noticed 4 issues with the fallocate(2) patch, even I had started to work on this and have a patch quite similar as yours :)

need to make use of truncate_pagecache_range() instead of truncate_inode_pages_range(), the former also unmaps the
i_mutex inode lock should be made use of to protect from beginning of the function till end against many race conditions. One example is : -
Consider 2 operations performed in following order on a file with 10 mb valid data: -
1. fallocate - punch hole from 0 to 1 mb - on-disk data blocks are freed using range-
locking of znode context switch happens.
2. read of 0 to 1 mb on same file, gets valid data form page-cache since truncate not done
till this point.
Hence, whole function scope should be i_mutex protected.
Consider a case where delalloc is in picture, data is written to a file but its in memory and no on-disk blocks are allocated, if a fallocate() request comes on this file, we need to sync the files contents to disk first, so they are allocated blocks and then fallocate() should continue, need to make use of something like filemap_write_and_wait_range() for this purpose.
Need to validate file size by making use of inode_newsize_ok().

dweeezil · 2014-07-19T14:09:37Z

@rohan-puri All very good points: I was looking at other file native file systems as a reference but I can see that what you're describing is pretty much what ext4 does. For some reason, I didn't look at ext4 very closely. Here's my thoughts on these issues:

I was going to look into this a bit further as it was obvious to me there have been a number of API changes in this area within Linux. I just looked into truncate_pagecache_range() and see it was added in 3.4 so we'll need another compat test as we currently do for truncate_setsize() but your point is correct in that we need the equivalent of unmap_mapping_range() which is what the new function gives us (in newer kernels). I'll be working on this further and likely adding another autotools test and compat wrapper.
Ext4 does exactly this and I can certainly understand the point. I think the more interesting race conditions involve concurrent operations which change the size of the file (in particular, reducing it). It's also interesting to note that the only place within zfsonlinux where the inode lock is taken is in zpl_llseek() (which is a somewhat related operation; involving holes). I'll add the lock.
I'll have to think about this further. Isn't "delalloc" mainly an ext4 feature whereas in ZFS allocations are always delayed (down to the zio layer). I'll note that zfs_freesp() (along with zfs_extend() and zfs_trunc()) all create their own transactions.
I'm wondering why since this call can't change the size. The only operation supported in zfsonlinux is hole punching which can only free space and cannot change the size of the file.

I'll try to get a refresh of this patch pushed this weekend.

dweeezil · 2014-07-19T17:01:22Z

Re-pushed with inode locking added and also with use of truncate_pagecache_range() rather thantruncate_inode_pages_range(). Tested with xfstests generic/127 under 2.6.32.61 and 3.13.7.

dweeezil · 2014-07-20T11:45:01Z

The after adding the inode lock, fsstress has uncovered a deadlock. I'm looking into it.

dweeezil · 2014-07-21T04:27:16Z

Added a missing unlock in the error case and re-pushed. No more deadlock when running fsstress.

rohan-puri · 2014-07-21T07:15:16Z

@dweeezil : Great :)

ok.
I agree, race conditions mostly related to file size are of interest.
I agree that in zfs its always delayed. I am just trying to figure out, can there be a case where file write data is still in memory, so its related blocks are not allocated & suddenly punch hole requests comes & starts executing, wont this create a problem? The thing i am trying to stress on is, punch holing on a file should be done only when its in-memory & on-disk state is consistent.
zfs_space() allocates space in the file too, so why we do not support fallocate(2) calls mode = 0x0, call? similar to other file systems? So, zfs_space() is called unconditional and in this case we will change the file size & we will need inode_newsize_ok() check.

According to the Linux documentation, the FALLOC_FL_KEEP_SIZE flag must be set in conjunction with the FALLOC_FL_PUNCH_HOLE flag. The previous logic caused a failure whenever FALLOC_FL_KEEP_SIZE was present. Also, mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add an autotools check for truncate_pagecache_range(). References: https://git.kernel.org/cgit/docs/man-pages/man-pages.git/tree/man2/fallocate.2#n102

dweeezil · 2014-07-21T18:13:23Z

Added a test to return EINVAL when the length is <= 0. I'll not have a chance to look at your points 3&4, @rohan-puri until later.

dweeezil · 2014-07-22T18:56:26Z

@behlendorf I'll try to get this re-worked and also to base it on your recent zfs_trunc() fix.

dweeezil · 2014-08-12T16:28:07Z

I'm going to close this one and re-open it with a more appropriate description once I re-work it (which should be pretty soon).

behlendorf added this to the 0.6.4 milestone Jul 17, 2014

behlendorf added Feature labels Jul 17, 2014

dweeezil closed this Aug 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux semantics for fallocate() #2499

Linux semantics for fallocate() #2499

dweeezil commented Jul 15, 2014

dweeezil commented Jul 15, 2014

kernelOfTruth commented Jul 16, 2014

kernelOfTruth commented Jul 16, 2014

dweeezil commented Jul 16, 2014

dweeezil commented Jul 16, 2014

dweeezil commented Jul 16, 2014

behlendorf commented Jul 17, 2014

behlendorf commented Jul 17, 2014

dweeezil commented Jul 18, 2014

dweeezil commented Jul 18, 2014

rohan-puri commented Jul 19, 2014

dweeezil commented Jul 19, 2014

dweeezil commented Jul 19, 2014

dweeezil commented Jul 20, 2014

dweeezil commented Jul 21, 2014

rohan-puri commented Jul 21, 2014

dweeezil commented Jul 21, 2014

dweeezil commented Jul 22, 2014

dweeezil commented Aug 12, 2014

Linux semantics for fallocate() #2499

Linux semantics for fallocate() #2499

Conversation

dweeezil commented Jul 15, 2014

dweeezil commented Jul 15, 2014

kernelOfTruth commented Jul 16, 2014

kernelOfTruth commented Jul 16, 2014

dweeezil commented Jul 16, 2014

dweeezil commented Jul 16, 2014

dweeezil commented Jul 16, 2014

behlendorf commented Jul 17, 2014

behlendorf commented Jul 17, 2014

dweeezil commented Jul 18, 2014

dweeezil commented Jul 18, 2014

rohan-puri commented Jul 19, 2014

dweeezil commented Jul 19, 2014

dweeezil commented Jul 19, 2014

dweeezil commented Jul 20, 2014

dweeezil commented Jul 21, 2014

rohan-puri commented Jul 21, 2014

dweeezil commented Jul 21, 2014

dweeezil commented Jul 22, 2014

dweeezil commented Aug 12, 2014